# Computational Analysis of IKKB - PND 4 mice samples - Python File Created by : Sayane Shome Date Started : 4 April 2022 Date last updated : 7 November 2022 The first parts of code pipeline is heavily inspired from tutorial in the link : https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html where Scanpy developers have tried to recreate the clustering performed in Seurat R package.Importing the required python packages
There are 4 samples : 2 Cre + (1 male(S11) and 1 female(S12)); 2 Cre - (1 male(S13) and 1 female(S14)).Cre + samples are IKKB knockout mice cases whereas Cre - are the wild type cases. In the next step,read in the count matrices for each case generated by CellRanger(10X Genomics) into an AnnData object,which holds many slots for annotations and different representations of the data.
Add column type in the Anndata object
From scanpy manual (https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.highest_expr_genes.html) We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment. – Davis McCarthy and Aaron Lun Next code show top 20 genes that yield the highest fraction of counts in each single cell, across all cells.
Basic filtering in the next step where cells with minimum 200 expressed genes and genes which are expressed in atleast 3 cells are filtered.
From the tutorial "Let’s assemble some information about mitochondrial genes, which are important for quality control. Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017): High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane. With pp.calculate_qc_metrics, we can compute many metrics very efficiently."
The next code will generate the violin plots : A violin plot of some of the computed quality measures: -the number of genes expressed in the count matrix -the total counts per cell -the percentage of counts in mitochondrial genes
Remove cells that have too many mitochondrial genes expressed or too many total counts:
Actually do the filtering by slicing the AnnData object.
Total-count normalize (library-size correct) the data matrix 𝐗 X to 10000 reads per cell, so that counts become comparable among cells
Logarithmize the data
Identify highly-variable genes
We can get back an AnnData of the object in .raw by calling .raw.to_adata().
Actually do the filtering
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.
Scale each gene to unit variance. Clip values exceeding standard deviation 10.
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.Differential Expression of Genes #Article to check out for softwares : #https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1949-zCombine the anndata files for all the 4 cases.
Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.
Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.louvain() or tSNE sc.tl.tsne(). In our experience, often a rough estimate of the number of PCs does fine.
Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat’s results, let’s take the following values.
Reason for using leiden clustering : https://www.nature.com/articles/s41598-019-41695-zLeiden clustering of the data in different resolutions
Add a new variable in the Annobject for easier processing
Plot all the UMAP plots for entire data in different resolutions
UMAP plot for Cre- showing leiden clustering at 0.1 resolution
UMAP plot for Cre+ showing leiden clustering at 0.1 resolution
UMAP plot for entire data showing leiden clustering at 0.1 resolution
The X matrix only contains the variable genes,while the raw matrix contains all genes.
For DGE analysis we would like to run with all genes, but on normalized values, so we will have to revert back to the raw matrix and renormalize.T-test
Top 25 genes in each cluster
"The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar.We recommend using the latter in publications, see e.g., Sonison & Robinson (2018). You might also consider much more powerful differential testing packages like MAST, limma, DESeq2 and, for python, the recent diffxpy." -- From Scanpy tutorial
"1.0.4 Logistic regression test As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details." -- Scanpy tutorial
1.0.2 T-test overestimated_variance
Dendrogram calculation for leiden clustering at 0.1 resolution
Heatmap construction of DE genes based on Wilcoxon testing clustered based on Leiden clustering at 0.1 resolution.
Dotplot construction of DE genes based on Wilcoxon testing clustered based on Leiden clustering at 0.1 resolution.
Violin plot construction of DE genes based on Wilcoxon testing clustered based on Leiden clustering at 0.1 resolution.